1 Introduction

In this study, two correlated stock pairs are analyzed in BIST30. First, the basic pairs trading strategy is examined using constant variance assumption. Next, advanced pairs trading strategy using time series analysis is observed in these stocks.

2 Highly Correlated Pairs

First of all, we need to choose two stock pairs that are strongly correlating.

pattern <- "\\d{8}_\\d{8}_bist30\\.csv"
matching_files <- list.files(path = "Data", pattern = pattern)
matching_files <- paste0("Data/", matching_files)
# Create an empty list to store the data frames
long_data <- data.frame("timestamp" =c(), "price" = c(), "short_name" = c())

# Loop through each CSV file and read it into a data frame
for (file in matching_files) {
  data <- read.csv(file)
  data$timestamp <- ymd_hms(data$timestamp)
  data$short_name <- as.factor(data$short_name)
  long_data <- rbind(long_data, data)
}

The whole data given is the following:

long_data %>%
  ggplot() +
  geom_line(aes(x = timestamp, y = price, color = short_name))

wide_data <- long_data %>%
  pivot_wider(names_from = short_name, values_from = price)

The data given to us is too large to examine as a whole. For simplicity, we will use only the data from 2018 to 2020, where the data is less volatile when compared to the data obtained after 2022.

We can plot the correlation matrix of the stocks.

long_data %>%
  filter(timestamp %within% interval(ymd("2018-01-01"), ymd("2020-01-01"))) %>%
  group_by(Date = date(timestamp), short_name) %>%
  summarize(Price = mean(price), .groups = "drop") %>%
  pivot_wider(names_from = short_name, values_from = Price) %>%
  select(!Date) %>%
  cor(use = "complete.obs") %>%
ggcorrplot(
           hc.order = TRUE,
           outline.col = "white",
           type = "upper", lab = TRUE,
           title = "Correlation Matrix of BIST30 Stocks Between 2018 and 2020",
           colors = c("darkred","white","darkgreen"),
           legend.title = "Correlation",
           ggtheme = theme_void)

From the correlation matrix, we pick GARAN-AKBNK (0.97 correlation) and YKBNK-ISCTR (0.94 correlation) stock pairs for pairs trading study.

3 Task 1: Basic Pairs Trading Strategy Using Constant Variance As- sumption

3.1 GARAN-AKBNK

Here is the plot of GARAN and AKBNK from 2018 to 2020:

garan_akbnk_data <- wide_data %>%
  select(c(timestamp, GARAN, AKBNK)) %>%
  filter(timestamp %within% interval("2018-01-01", "2020-01-01"))

garan_akbnk_data %>%
  pivot_longer(cols = c("GARAN", "AKBNK"), names_to = "Stock", values_to = "Price") %>%
  ggplot() +
  geom_line(aes(x = timestamp, y = Price, color = Stock)) +
  labs(title = "GARAN and AKBNK Stocks from 2018 to 2020")

AKBNK and GARAN show a similar trend over time. To model their relationship, a linear regression model between GARAN and AKBNK is built.

model1 <- lm(formula = GARAN ~ AKBNK, data = garan_akbnk_data)

summary(model1)
## 
## Call:
## lm(formula = GARAN ~ AKBNK, data = garan_akbnk_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99924 -0.18489  0.00207  0.20937  0.90240 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.031388   0.029689   1.057     0.29    
## AKBNK       1.341477   0.004994 268.616   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3044 on 4971 degrees of freedom
## Multiple R-squared:  0.9355, Adjusted R-squared:  0.9355 
## F-statistic: 7.215e+04 on 1 and 4971 DF,  p-value: < 2.2e-16

According to the statistics, the linear regression of GARAN with respect to AKBNK is statistically significant.

The residuals of the model are:

data.frame(index = 1:length(model1$residuals), residuals = model1$residuals) %>%
  ggplot() +
  geom_point(aes(x = index, y = residuals)) +
  labs(title= "Residuals of the linear regression model of GARAN and AKBNK")

We can plot an X-bar control chart to the residuals to spot the outliers. Here is the X-bar chart:

qcc1 <- qcc(data = model1$residuals, type = "xbar.one", std.dev = "SD", nsigmas = 2)

The standard deviation of the residuals is 0.304352. The lower and upper 2 sigma limits are: -0.608704, 0.608704, respectively. The red points indicate the residuals that are outside the limits. According to the pairs trading strategy, the stocks should be traded when the points lie beyond limits. When the point is below LCL GARAN is sold and AKBNK is bought, and when it is above UCL, the opposite is performed. So, with this strategy, the profit becomes:

garan_akbnk_data$SELL_GARAN_BUY_AKBNK <- model1$residuals < qcc1$limits[,"LCL"]
garan_akbnk_data$SELL_AKBNK_BUY_GARAN <- model1$residuals > qcc1$limits[,"UCL"]

income <-sum(garan_akbnk_data %>%
               filter(SELL_GARAN_BUY_AKBNK) %>%
               select(GARAN)) +
  sum(garan_akbnk_data %>%
        filter(SELL_AKBNK_BUY_GARAN) %>% select(AKBNK))

loss <-sum(garan_akbnk_data %>%
               filter(SELL_GARAN_BUY_AKBNK) %>%
                 select(AKBNK)) +
  sum(garan_akbnk_data %>%
        filter(SELL_AKBNK_BUY_GARAN) %>% 
        select(GARAN))

income-loss
## [1] 30.3665

3.2 YKBNK-ISCTR

Now, we examine YKBNK and ISCTR. Here is the plot of the stocks from 2018 to 2020.

ykbnk_isctr_data <- wide_data %>%
  select(c(timestamp, YKBNK, ISCTR)) %>%
  filter(timestamp %within% interval("2018-01-01", "2020-01-01"))

ykbnk_isctr_data %>%
  pivot_longer(cols = c("YKBNK", "ISCTR"), names_to = "Stock", values_to = "Price") %>%
  ggplot() +
  geom_line(aes(x = timestamp, y = Price, color = Stock)) +
  labs(title = "YKBNK and ISCTR Stocks from 2018 to 2020")

Next, we build a linear regression model to predict ISCTR stocks with YKBNK:

model2 <- lm(formula = ISCTR ~ YKBNK, data = ykbnk_isctr_data)

summary(model2)
## 
## Call:
## lm(formula = ISCTR ~ YKBNK, data = ykbnk_isctr_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32547 -0.09754 -0.00799  0.08306  0.35670 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.313005   0.009571    32.7   <2e-16 ***
## YKBNK       0.954459   0.004744   201.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1255 on 4971 degrees of freedom
## Multiple R-squared:  0.8906, Adjusted R-squared:  0.8906 
## F-statistic: 4.048e+04 on 1 and 4971 DF,  p-value: < 2.2e-16

The linear regression model is statistically significant.

Like the GARAN-AKBNK case, we can check the residuals for the pairs trading strategy. Here is the plot of the residuals:

data.frame(index = 1:length(model2$residuals), residuals = model2$residuals) %>%
  ggplot() +
  geom_point(aes(x = index, y = residuals)) + 
  labs(title= "Residuals of the linear regression model of YKBNK and ISCTR")

When we plot these residuals on an X-bar chart, we have the following:

qcc2 <- qcc(data = model2$residuals, type = "xbar.one", std.dev = "SD", nsigmas = 2)

The standard deviation of the residuals is 0.1254683. The lower and upper 2 sigma limits are: -0.2509366, 0.2509366, respectively. The red points indicate the residuals that are outside the limits. According to the pairs trading strategy, we should sell ISCTR - buy YKBNK when the residuals are below LCL, and buy ISCTR - sell YKBNK when the residuals are above UCL.

Here is the profit associated with this strategy:

ykbnk_isctr_data$SELL_ISCTR_BUY_YKBNK <- model2$residuals < qcc2$limits[,"LCL"]
ykbnk_isctr_data$SELL_YKBNK_BUY_ISCTR <- model2$residuals > qcc2$limits[,"UCL"]

income <-sum(ykbnk_isctr_data %>%
               filter(SELL_ISCTR_BUY_YKBNK) %>%
               select(ISCTR)) +
  sum(ykbnk_isctr_data %>%
        filter(SELL_YKBNK_BUY_ISCTR) %>% select(YKBNK))

loss <-sum(ykbnk_isctr_data %>%
               filter(SELL_ISCTR_BUY_YKBNK) %>%
                 select(YKBNK)) +
  sum(ykbnk_isctr_data %>%
        filter(SELL_YKBNK_BUY_ISCTR) %>% 
        select(ISCTR))

income-loss
## [1] -80.3799

This time, the pairs trading strategy did not give us a positive profit. This is possible, because the market dynamics cannot be modeled perfectly.

To sum up, this strategy uses linear regression modeling and identify highly correlated stock pairs. Then control limits are determined for trading with the assumption of constant variance. In short-term, this strategy may be efficient and bring profit. To do that, control chart send signals for initiate tradings. On the other hand, this assumption may not hold in all conditions and may result wrong or inexact control limits. Because of this strategy depends on that past correlations continue in the future, it may not be like that in the future.

4 Task 2

4.1 GARAN - AKBNK

In this part, advanced time series analysis is conducted to model the residuals. First, we check the autocorrelation of the residuals for GARAN and AKBNK:

checkresiduals(model1)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 4906.4, df = 10, p-value < 2.2e-16

The residual are highly autocorrelated, which is not desired in the linear regression model.

We can improve the model by adding lagged values. We introduce GARAN’s lag 1 value to the model:

garan_akbnk_data$GARAN_LAG1 <- lag(garan_akbnk_data$GARAN)
garan_akbnk_data
## # A tibble: 4,973 × 6
##    timestamp           GARAN AKBNK SELL_GARAN_BUY_AKBNK SELL_AKBNK_BUY_GARAN
##    <dttm>              <dbl> <dbl> <lgl>                <lgl>               
##  1 2018-01-02 06:00:00  9.20  6.95 FALSE                FALSE               
##  2 2018-01-02 07:00:00  9.32  7.06 FALSE                FALSE               
##  3 2018-01-02 08:00:00  9.34  7.10 FALSE                FALSE               
##  4 2018-01-02 09:00:00  9.32  7.08 FALSE                FALSE               
##  5 2018-01-02 10:00:00  9.33  7.10 FALSE                FALSE               
##  6 2018-01-02 11:00:00  9.34  7.14 FALSE                FALSE               
##  7 2018-01-02 12:00:00  9.32  7.12 FALSE                FALSE               
##  8 2018-01-02 13:00:00  9.34  7.12 FALSE                FALSE               
##  9 2018-01-02 14:00:00  9.33  7.12 FALSE                FALSE               
## 10 2018-01-02 15:00:00  9.32  7.11 FALSE                FALSE               
## # ℹ 4,963 more rows
## # ℹ 1 more variable: GARAN_LAG1 <dbl>
model3 <- lm(formula = GARAN ~ GARAN_LAG1 + AKBNK, data = garan_akbnk_data)
summary(model3)
## 
## Call:
## lm(formula = GARAN ~ GARAN_LAG1 + AKBNK, data = garan_akbnk_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45535 -0.02326  0.00047  0.02436  0.50531 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006367   0.005414  -1.176     0.24    
## GARAN_LAG1   0.972249   0.002557 380.294   <2e-16 ***
## AKBNK        0.038482   0.003545  10.854   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05548 on 4969 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9979, Adjusted R-squared:  0.9979 
## F-statistic: 1.158e+06 on 2 and 4969 DF,  p-value: < 2.2e-16

The model is statistically significant. We can check the residuals:

checkresiduals(model3)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 71.577, df = 10, p-value = 2.197e-11

The introduction of the lagged value decreased the autocorrelation of the residuals. We can continue with this model.

Next, we plot the X-bar chart of the new model. This time, we use 3 sigmas as the limit, because 2 sigma limits cause too many false alarms.

qcc3 <- qcc(data = model3$residuals, type = "xbar.one", std.dev = "SD", nsigmas = 3)

We follow the same procedure to calculate the profit associated with the pairs trade. The profit is:

garan_akbnk_data$SELL_GARAN_BUY_AKBNK <- c(FALSE, model3$residuals < qcc3$limits[,"LCL"])
garan_akbnk_data$SELL_AKBNK_BUY_GARAN <- c(FALSE, model3$residuals > qcc3$limits[,"UCL"])

income <-sum(garan_akbnk_data %>%
               filter(SELL_GARAN_BUY_AKBNK) %>%
               select(GARAN)) +
  sum(garan_akbnk_data %>%
        filter(SELL_AKBNK_BUY_GARAN) %>% select(AKBNK))

loss <-sum(garan_akbnk_data %>%
               filter(SELL_GARAN_BUY_AKBNK) %>%
                 select(AKBNK)) +
  sum(garan_akbnk_data %>%
        filter(SELL_AKBNK_BUY_GARAN) %>% 
        select(GARAN))

income-loss
## [1] 24.5917

With the pairs trade, we obtain a positive profit.

4.2 YKBNK - ISCTR

First of all, we check the autocorrelation of YKBNK and ISCTR in the model used in Task 1:

checkresiduals(model2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 4916.7, df = 10, p-value < 2.2e-16

The residuals are highly autocorrelated.

We can improve the model by adding lag 1 of ISCTR:

ykbnk_isctr_data$ISCTR_LAG1 <- lag(ykbnk_isctr_data$ISCTR)
ykbnk_isctr_data
## # A tibble: 4,973 × 6
##    timestamp           YKBNK ISCTR SELL_ISCTR_BUY_YKBNK SELL_YKBNK_BUY_ISCTR
##    <dttm>              <dbl> <dbl> <lgl>                <lgl>               
##  1 2018-01-02 06:00:00  2.45  2.63 FALSE                FALSE               
##  2 2018-01-02 07:00:00  2.47  2.64 FALSE                FALSE               
##  3 2018-01-02 08:00:00  2.48  2.64 FALSE                FALSE               
##  4 2018-01-02 09:00:00  2.48  2.64 FALSE                FALSE               
##  5 2018-01-02 10:00:00  2.48  2.64 FALSE                FALSE               
##  6 2018-01-02 11:00:00  2.50  2.67 FALSE                FALSE               
##  7 2018-01-02 12:00:00  2.49  2.66 FALSE                FALSE               
##  8 2018-01-02 13:00:00  2.49  2.66 FALSE                FALSE               
##  9 2018-01-02 14:00:00  2.49  2.66 FALSE                FALSE               
## 10 2018-01-02 15:00:00  2.49  2.66 FALSE                FALSE               
## # ℹ 4,963 more rows
## # ℹ 1 more variable: ISCTR_LAG1 <dbl>
model5 <- lm(formula = ISCTR ~ YKBNK + ISCTR_LAG1, data = ykbnk_isctr_data)
summary(model5)
## 
## Call:
## lm(formula = ISCTR ~ YKBNK + ISCTR_LAG1, data = ykbnk_isctr_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.122347 -0.007018 -0.000023  0.006982  0.147716 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.002666   0.001343   1.986   0.0471 *  
## YKBNK       0.007876   0.001825   4.316 1.62e-05 ***
## ISCTR_LAG1  0.991699   0.001804 549.588  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01597 on 4969 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9982, Adjusted R-squared:  0.9982 
## F-statistic: 1.401e+06 on 2 and 4969 DF,  p-value: < 2.2e-16

The model is statistically significant and has a better adjusted R-squared value than the previous model. We continue by checking the residuals:

checkresiduals(model5)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 21.057, df = 10, p-value = 0.0207

The autocorrelation problem in the first model decreased significantly. We can use this model for detecting the pairs trade dates.

qcc4 <- qcc(data = model5$residuals, type = "xbar.one", std.dev = "SD", nsigmas = 3)

We perform the same steps and calculate the profit:

ykbnk_isctr_data$SELL_ISCTR_BUY_YKBNK <- c(FALSE, model5$residuals < qcc4$limits[,"LCL"])
ykbnk_isctr_data$SELL_YKBNK_BUY_ISCTR <- c(FALSE, model5$residuals > qcc4$limits[,"UCL"])

income <-sum(ykbnk_isctr_data %>%
               filter(SELL_ISCTR_BUY_YKBNK) %>%
               select(ISCTR)) +
  sum(ykbnk_isctr_data %>%
        filter(SELL_YKBNK_BUY_ISCTR) %>% select(YKBNK))

loss <-sum(ykbnk_isctr_data %>%
               filter(SELL_ISCTR_BUY_YKBNK) %>%
                 select(YKBNK)) +
  sum(ykbnk_isctr_data %>%
        filter(SELL_YKBNK_BUY_ISCTR) %>% 
        select(ISCTR))

income-loss
## [1] -2.6059

This model also gave a negative profit, but the overall loss is less than the loss calculated in Task 1.

Advanced Pairs Trading Strategy using Time Series Analysis is a more dynamic strategy using revised control limits with residuals. It reacts changes in market and evolved relations of stock pairs. Also usage of time series results less risky signals for trading. However, in this strategy, if there is not much data, overfitting may occur. In our model, we have used more data to escape this situation.

5 COMPARISON

There are different benefits of using both of these strategies. We should choose proper strategy depending on conditions. In short-term, first strategy might be more profitable however in the long-term, due to lots of changes in market conditions, using second strategy would be more logical. In conclusion, the second method with time series analysis offers the possibility for improved accuracy and adaptability while the first strategy offers a simple approach. Both tactics, however, have disadvantages and must be carefully considered in light of a number of considerations in order to be used successfully.